EB3I n1 2025 scRNAseq
-
Differential expression analysis
EB3I n1 2025 scRNAseq
-
Differential expression analysis
| # Start Rstudio |
| - Using the OpenOnDemand/Rstudio cheat sheet, connect to the OpenOnDemand portal and create a Rstudio session with the right resource requirements. |
| # Warm-up |
| - We set common parameters we will use throughout this session : |
1 Prepare the data structure
We will do the same as for former steps, just changing the session name :
1.1 Main directory
1.2 Current session
1.3 Input directory
1.4 Output directory
2 Reload the Seurat Object
- We can reload the object we saved at the former step
3 Forewords
I need three teams for this session:
- Team Wilcoxon
- Team Student
- Team [MAST]
I will also need your help because I can’t make DESeq2 work correctly.
But I’m sure, that we will solve my issue: you’re in the best session here at EB3I.
3.1 TLDR: R command lines
In this presentation, there will be screen captures for you to follow the lesson. There will also be every single R command lines.
Do not take care of the command lines if you find them too challenging.
Our goal here, is to understand the main mechanism of Differential Expression Analysis. R is just a tool.
Below are the libraries we need to perform this whole session:
base::library(package = "BiocParallel") # Optionally multithread some steps
base::library(package = "DT") # Display nice tables in HTML
base::library(package = "ggplot2") # Draw nice graphs and plots
base::library(package = "ggpubr") # Draw even nicer graphs
base::library(package = "rstatix") # Base R statistics
base::library(package = "knitr") # Build this presentation
base::library(package = "dplyr") # Handle big tables
base::library(package = "Seurat") # Handle SingleCell analyses
base::library(package = "SeuratObject") # Handle SingleCell objects for Seurat
base::library(package = "SingleCellExperiment") # Handle SingleCell file formats
base::library(package = "UpSetR") # Nice venn-like graphs
base::library(package = "EnhancedVolcano") # Draw Volcano plotThen we join layers:
Then we perform differential expression analysis:
# run_dea
sobj_de <- Seurat::FindMarkers(
# Object on which to perform DEA
object = sobj,
# Name of factor in condition 1
ident.1 = "TD3A",
# Name of factor in condition 2
ident.2 = "TDCT"
)And that’s all ! Our goal is to understand these lines, being able to write them is a bonus.
3.2 Purpose of this session
Up to now, we have:
- Identified to which cell each sequenced reads come from
- Identified to which gene each read come from
- Identified possible bias in gene expression for each cell
- Filtered and corrected these bias as well as we can
We would like to identify the list of genes that characterize differences between cell cycle phases G1 and S groups.
At the end of this session you will know:
- how to select a differential analysis method
- how to select the correct normalization (if any?) that must be provided to your differential analysis method
- How to read differential expression results
3.3 Insight
We are wondering what’s in our dataset. Let’s have a look,
with the function print
form the package base.
Show output
An object of class Seurat
12926 features across 3886 samples within 1 assay
Active assay: RNA (12926 features, 2000 variable features)
3 layers present: data, counts, scale.data
6 dimensional reductions calculated: pca, CCAIntegration, umap, RPCAIntegration, HarmonyIntegration, HarmonyStandalone
We have 0 features (aka genes), across 0 samples (aka cells) in the TD3A condition. We have 0 features (aka genes), across 0 samples (aka cells) in the TD3A condition.
Let us have a look at the RNA counts for 10 cells and their annotation,
with the function head from the package utils.
# head_seurat_counts_phase
utils::head(
# Object to visualize
x = sobj,
# Number of lines to display
n = 10
)Prettier with a table in HTML :
There are no counts, normalized or not !!
Where are they ?
In order to explore the content of sobj, use the function str from the package utils:
Alternatively, in RStudio, you can click on the object pane and explore manually
the content of the object. If we explore the slot assays, then we find the counts.
You can access them with:
# head_seurat_count_table
utils::head(
# Object to visualize
x = SeuratObject::GetAssayData(object = sobj, assay = "RNA", layer = "counts"),
# Number of rows to display
n = 10
)Prettier with a table in HTML :
We have one gene per line, one cell per column, and RNA counts in each row.
For the sake of this session, we won’t compare the whole dataset. It would take up to 15 minutes to complete. During the rest of this session, we will compare the Louvain clusters 8 and 10, from our integration with Harmony.
We need to re-annotate layers to do so. This is done in two steps:
- Redefine
Identsin order to be able to call cells by their cluster names. JoinLayersto have a single count table with both TD3A and TDCT cells from clusters 8 and 10 together.
# ident_and_join_layers
Seurat::Idents(sobj) <- sobj$HarmonyStandalone_clusters
sobj <- SeuratObject::JoinLayers(sobj)Let’s check if our cells are joint:
Show output
An object of class Seurat
12926 features across 3886 samples within 1 assay
Active assay: RNA (12926 features, 2000 variable features)
3 layers present: data, counts, scale.data
6 dimensional reductions calculated: pca, CCAIntegration, umap, RPCAIntegration, HarmonyIntegration, HarmonyStandalone
Question :
We have one gene per line, one cell per column, and RNA counts in each row.
# a_normscalfilt ## . These counts are normalized, scaled, filtered for sure. ## ## . This information is available in the seurat object itself, ## within the slot `@commands`.See an example below:
# a_normscalfilt ## . These counts are normalized, scaled, filtered for sure. ## ## . Total counts, max value, counts per cell and number of ## expressed features per cell, all have decreased. ## ## . This information is available in the seurat object itself, ## within the slot `@commands`. See an example below:Show output
[1] "NormalizeData.RNA" [2] "FindVariableFeatures.RNA" [3] "ScaleData.RNA" [4] "RunPCA.RNA" [5] "RunUMAP.RNA.CCAIntegration" [6] "FindNeighbors.RNA.CCAIntegration" [7] "RunUMAP.RNA.RPCAIntegration" [8] "FindNeighbors.RNA.RPCAIntegration" [9] "RunUMAP.RNA.HarmonyIntegration" [10] "FindNeighbors.RNA.HarmonyIntegration" [11] "RunUMAP.RNA.HarmonyStandalone" [12] "FindNeighbors.RNA.HarmonyStandalone" [13] "FindClusters"However, please remember that :
- Counts in the slot
countare raw counts - Normalized counts are in the slot
data - scaled data are in the slot
scaled.data(this one was damn obvious).
And it you do not find that crystal-clear, I totally agree with you.
- Counts in the slot
Raw counts have many, so many zeros …
# a_sozero ## . The large number of null counts is completely normal for this ## "droplet" technology. In maths/stats, we talk about "matrix sparcity", ## i.e a table with lots (lots!) of zeros.And count values are so low …
4 Select a DE method
4.1 Available methods
Seurat let us use multiple
differential analysis methods with its function FindMarkers.
- wilcox: The wilcoxon test tests the mean of expression and looks for a difference in these means.
- MAST: This tool has been built for Single Cell. It is based on a statistical model called “Hurdle Model”, which excells with data that contains lots of zeros (which is our case in Single Cell RNA-Seq: most of the genes are not expressed.)
- DESeq2: This tool has originally been built for bulk RNA-Seq but now includes specific functions for Single Cell. It performs well when counts are highly variable or when you wand to compare a handful of cells.
- t-test: The t-test uses a comparison of means to assess differential expression probability.
- roc: In this method, an AUC value of
1means that expression values for this gene alone can perfectly classify the two groupings, i.e : each of the cells in “cells.1” (condition 1) exhibit a higher level than each of the cells in “cells.2” (condition 2). An AUC value of0also means there is perfect classification, but in the opposite direction (higher in condition 2 than in 1). A value of0.5implies that the gene has no predictive power to classify the two groups.
The main question now is how to choose the right test ??
Spoilers : there are no option better than another in all ways ! Thank you for your attention, you can go back home now.
From Soneson & Robinson (2018) Nature Methods:
Here, researchers have designed an artificial dataset where they knew in advance the list of differentially expressed genes. They have used all these algorithms and consigned the results.
DESeq2,Limmaseem to have a higher number of false positives (genes identified as differentially expressed while they were actually not.)Wilcoxonseems to be better in general.MASTperforms well in absolute
ANOVA was not present in this study (but probably would have performed similarly to t-test.
4.2 A case: the Plac8 gene
The question is now to guess whether this gene is differentially expressed, or not.
4.2.1 Cell observations
Let’s have a look at the gene named ‘Plac8’
It is known to be involved in positive regulation of cold-induced thermogenesis, and positive regulation of transcription by RNA polymerase II.
In order to plot its expression across all cells, we are going to use the
function VlnPlot
from the Seurat package.
The input object is obviously contained in the sobj variable, since it is
our only Seurat object. In addition, we are going to select the feature Plac8,
and split the graph according to the clusters we annotated earlier.
# seurat_vlnplot_Plac8_demo
Seurat::VlnPlot(
# A subset of the Seurat object
# limited to clusters 8 and 10,
# or else we will plot all the clusters
object = subset(sobj, HarmonyStandalone_clusters %in% c("8", "10")),
# The name of the gene of interest (feature = gene)
features = "Plac8",
# The name of the Seurat cell annotation
split.by = "HarmonyStandalone_clusters",
# Change color for presentation
cols = c("darkslategray3", "olivedrab")
)Show plot
# q_isplac8de
## Using your intuition, is this gene differentially expressed
## between clusters 8 and 10 ?# q_isplac8de
## . In cluster 10, the violin plot highlights almost no cells with low or
## zero `Plac8` expression. The highest density of cells has a `Plac8`
## normalized expression aroung ~1.5.
##
## . In cluster 8, cells seem to have no expression of `Plac8` at all.
##
## . IMHO (feel free to disagree), the expression of the gene `Plac8` differs
## between our clusters 8 and 10. This is purely intuitive.Now, let’s check the same gene expression, but comparing cells tagged as in the G1 and S phase.
# vlnplot_seurat_group_phase_code
Seurat::VlnPlot(
# The Seurat object
object = sobj,
# The name of the gene of interest (feature = gene)
features = "Plac8",
# The name of the Seurat cell-cycle annotation
group.by = "CC_Seurat_Phase",
# Change color for presentation
cols = c("darkslategray3", "olivedrab", "orangered3")
)Show plot
# q_isplac8deG1S
## Using your intuition, is this gene differentially expressed
## between the G1 and S cell cycle phases ?# a_isplac8deG1S
## . This one seems more tricky... One can observe that most of the expression
## is null, some cells express the gene.
##
## . We need to investigate a bit deeper !Okay, let’s get some information about these expression distributions.
# general_count_table
## Store counts in a variable
counts <- base::as.data.frame(
# The matrix to reformat into a dataframe
x = SeuratObject::GetAssayData(
object = sobj,
assay = "RNA",
layer = "data")
)
## Rename cells with cell harmony cluster
base::colnames(counts) <- paste(
# The names of the cell cluster for each cell
sobj$CC_Seurat_Phase,
# The names of the cells themselves
colnames(sobj),
sep = "_"
)Let’s check our counts variable :
4.2.2 From biology to statistics
Okay, let us resort on statistics to evaluate our chances to be guess correctly, or our risks to guess wrong.
We have lots of observations :
- 2838 cells within the G1 phase
- 711 cells within the S phase
Statisticians really like to have a lot of observations !
Ideally, statisticians never have enough observations, but at least they prefer when they have more than tests.
Here, we have a total of 3549 observations and we are testing a single gene.
We live in a statistician’s wet dream !
Now, if we think like a statistician, we can ask ourselves :
- Are our cells supposed to be interacting with each others ?
- Or are they independent from each others ?
- Does the expression in our cells follow a normal distribution ?
All of this is very important, and usually, it requires a discussion.
For the normal distribution, there is an easy check. Let’s draw the expression like we did above.
First, we use the function rbind
from base package.
Be careful, the function rbind also exists
in DelayedArray, data.table, and BiocGenerics packages !
We want to use the “basic” one.
# distribution_Plac8_table_build
## Add column idientifiers in each count dataframes
Plac8G1$phase <- "G1"
Plac8S$phase <- "S"
## Paste the rows beneith each other
Plac8 <- base::rbind(
## variable pointing to G1 counts
Plac8G1,
## variable pointing to S counts
Plac8S,
## A Boolean, requesting that strings/characters
## should not be casted as `logical`. It breaks graphs.
stringsAsFactors = FALSE
)Secondly, we use the function gghistogram from the package ggpubr in order to display relative abundance of gene expression :
# distribution_Plac8_display
ggpubr::gghistogram(
Plac8,
x = "Plac8",
y = '..density..',
fill = "steelblue",
bins = 15,
add_density = TRUE
)Show plot
Our distribution doesn’t seem to be normal, nor binomial … we will have to rely on non-parametric tests.
Let’s run a non-parametric test based on the mean of distributions, since it’s the clothest to our ‘intuitive’ approach. Let there be Wilcoxon test.
In R, it’s quite straightforward : we have the function
wilcoxon_test
to perform the test, then we can plot the result.
# wilcoxon_Plac8
## On the expression table stored in the varialbe `Plac8`,
## first apply the function `wilcox_test` from package `rstatix`,
## then we apply the function `add_significance` from package `rstatix`
stat.test <- Plac8 %>% rstatix::wilcox_test(Plac8 ~ phase) %>% rstatix::add_significance()
# While doing so, we usually also compute effect size
eff.size <- Plac8 %>% rstatix::wilcox_effsize(Plac8 ~ phase)Wilcoxon test says: the distributions are different, with a 3^{-13}
% of risks of being wrong. The gene Plac8 can safely be said differentially
expressed.
We can compute a fold change and conclude.
NOTE : Just out of curiosity, be aware that t_test
from rstatix package, gives the same answer (with different significance).
However, DESeq gives
a p-value of 0.05 and an adjusted p-value equal to 1 !
Depending on the test used, this gene is, or is not differentially expressed.
4.3 Conclusion
In conclusion, to select your DEA method, use the following rules of thumb :
- If you have already done a study with one of these methods, keep using the same. This is crutial if you ever want to compare your new result with the old ones.
- If you want to compare your study with a published one, use the same method.
- If you have no idea, use Wilcoxon.
- If you have bulk data analyzed with
DESeq2/Limma, useDESeq2/Limma. It will be easier to take both works in consideration.
Please, never ever use a simple Wilcoxon on bulk RNA-Seq data !!
4.4 NB MAST DE
MAST uses a hurdle model divided into two parts:
- Discrete – models the probability that a feature is detected in a cell (present or not).
- Continuous – models the level of expression for the feature in cells where it is detected.
MAST also uses the Cellular Detection Rate (CDR) as a covariate to adjust for sequencing effects, calculated as the fraction of detected genes per cell (nFeature barcode >0 / nHVG). It is also possible to include additional covariates, such as batch or experimental metrics, when analyzing more than two samples.
5 Select a dataset
5.1 Dataset depends on selected method
There it is quite easier:
5.2 FindMarkers
With the function FindMarkers
from package Seurat, we want to make three groups:
- One using
wilcoxonto perform DEA between clusters “8” and “10”. - One using
t-test to perform DEA between clusters “8” and “10”. - One using
MASTto perform DEA between “8” and “10”.
We will observe the results and compare our gene lists.
Hey, why are you looking at me? It’s your turn to work! Use the all the
notions seen above to select the right counts (slot), the right input
object, and the right arguments.
10 minutes practice !
Answers
Here are the code for each team:
sobj_wilcox <- Seurat::FindMarkers(
# The variable that contains Seurat Object
object = sobj,
# Name of condition 1
ident.1 = "8",
# Name of condition 2
ident.2 = "10",
# Factor name in the Seurat Object
group.by = "HarmonyStandalone_clusters",
# Differential analysis method
test.use = "wilcox"
)
sobj_t <- Seurat::FindMarkers(
# The variable that contains Seurat Object
object = sobj,
# Name of condition 1
ident.1 = "8",
# Name of condition 2
ident.2 = "10",
# Factor name in the Seurat Object
group.by = "HarmonyStandalone_clusters",
# Differential analysis method
test.use = "t"
)
sobj_mast <- Seurat::FindMarkers(
# The variable that contains Seurat Object
object = sobj,
# Name of condition 1
ident.1 = "8",
# Name of condition 2
ident.2 = "10",
# Factor name in the Seurat Object
group.by = "HarmonyStandalone_clusters",
# Differential analysis method
test.use = "MAST"
)In the function argument, there is a FoldChange threshold. Should we filter gene expression based on FoldChange? In case of positive answer, how much should that threshold be?
Answer
About thresholds on FDR (False Discovery Rate) and Log2(FC) (Log of the Fold Change), there are many discussions.
Remember, here the threshold on Fold Change is Logged. A log2(1) =0. And keep in mind the following:
- If one selects a fold change threshold above 1.7, then their study concludes that smoking is not related to lung cancer.
- If one selects a fold change threshold above 1, then their study concludes that a fast-food based diet does not lead to weight gain.
- If one selects a fold change threshold above 1/25 000 000, then their study concludes: it is a complete hazard that mice have fetal malformation when in presence of Bisphanol.
In conclusion, there one, and only one reason to filter on fold change: in my experience, a fold change below 0.7 will be hard to see/verify on wet-lab (qRT).
If you need to reduce a too large number of differentially expressed genes, then reduce the FDR to 0.01, or even better, to 0.001. With that, you reduce your number of false claims.
Can you help me with DEseq2?
When I run the following command line, I have an error :
sobj_deseq2 <- Seurat::FindMarkers(
# The variable that contains Seurat Object
object = sobj,
# Name of condition 1
ident.1 = 8,
# Name of condition 2
ident.2 = 10,
# Factor name in the Seurat Object
group.by = "HarmonyStandalone_clusters",
# Differential analysis method
test.use = "deseq2",
# Use non-normalized data with DESeq2
slot = "counts"
)Error in PerformDE(object = object, cells.1 = cells.1, cells.2 = cells.2, : Unknown test: deseq2
Answer
Oh! My fault, it was a typo in my command! Thank you all for your help!
sobj_deseq2 <- Seurat::FindMarkers(
# The variable that contains Seurat Object
object = sobj,
# Name of condition 1
ident.1 = "8",
# Name of condition 2
ident.2 = "10",
# Factor name in the Seurat Object
group.by = "HarmonyStandalone_clusters",
# Differential analysis method
test.use = "DESeq2",
# Use non-normalized data with DESeq2
slot = "counts"
)Remark: by doing surch modification, some fold changes have been modified: remember the gene Atad2 with a mean expression of 0.08 in G1, and 0.2 in S phases? Mean expressions are now around 1.08 for G1, and 1.2 for S phases. This may be the reason why it was not differentially expressed in DESeq2, while Wilcoxon and T-test returned adjusted pvalues far below 0.05.
6 Explore results
6.1 Big tables
Let us have a look at the results:
We have the following columns:
p_val: Ignore this column. Always ignore raw p-values. Look at corrected ones, and if they are missing, then compute them.avg_log2FC: Average Log2(FoldChange). Illustrates how much a gene is differentially expressed between samples in each condition.pct.1: Percent of cells with gene expression in condition one, here in cluster 8.pct.2: Percent of cells with gene expression in condition two, here in cluster 10.p_val_adj: The confidence we have in the result. The closer to 0, the lesser is the risk of error.
We have the following columns:
p_val: Ignore this column. Always ignore raw p-values. Look at corrected ones, and if they are missing, then compute them.avg_log2FC: Average Log2(FoldChange). Illustrates how much a gene is differentially expressed between samples in each condition.pct.1: Percent of cells with gene expression in condition one, here in cluster 8.pct.2: Percent of cells with gene expression in condition two, here in cluster 10.p_val_adj: The confidence we have in the result. The closer to 0, the lesser is the risk of error.
We have the following columns:
p_val: Ignore this column. Always ignore raw p-values. Look at corrected ones, and if they are missing, then compute them.avg_log2FC: Average Log2(FoldChange). Illustrates how much a gene is differentially expressed between samples in each condition.pct.1: Percent of cells with gene expression in condition one, here in cluster 8.pct.2: Percent of cells with gene expression in condition two, here in cluster 10.p_val_adj: The confidence we have in the result. The closer to 0, the lesser is the risk of error.
We have the following columns:
p_val: Ignore this column. Always ignore raw p-values. Look at corrected ones, and if they are missing, then compute them.avg_log2FC: Average Log2(FoldChange). Illustrates how much a gene is differentially expessed between samples in each condition.pct.1: Percent of cells with gene expression in condition one, here in cluster 8.pct.2: Percent of cells with gene expression in condition two, here in cluster 10.p_val_adj: The confidence we have in the result. The closer to 0, the lesser is the risk of error.
6.2 Filter DEA results
What kind of threshold should be used to filter each results?
If we must label certain scores as good or bad, we can reference the following rule of thumb:
0.5 = No discrimination 0.5-0.7 = Poor discrimination 0.7-0.8 = Acceptable discrimination 0.8-0.9= Excellent discrimination 0.9 = Outstanding discrimination
Hosmer and Lemeshow in Applied Logistic Regression (p. 177)
6.3 Add results to Seurat objects
We’d like to store the results of differential expression analysis in
the Seurat object.
6.4 Common results
Now we can plot intersections in an up-set graph. It is like a Venn diagram:
Show plot
6.5 Others functions in Seurat for DEA
There are two other functions in Seurat package to do DEA in different context. For example, you will see the difference between one cluster versus many of them. You could be use FindAllMarkers. Usage is equivalent to FindMarkers but it need to indicate which variable cluster variable.
sobj_findallmarkers_mast <- FindAllMarkers(sobj,
assay="RNA",
group.by="HarmonyStandalone_clusters",
test.use = "MAST",
only.pos = T,
logfc.threshold = 0.9,
min.pct = 0.1,
random.seed = my_seed)
DT::datatable(
head(sobj_findallmarkers_mast, n = 10),
caption = "MAST results"
)p <- EnhancedVolcano(
toptable = sobj_findallmarkers_mast,
lab = rownames(sobj_findallmarkers_mast),
x = "avg_log2FC",
y = "p_val_adj",
FCcutoff = 0.2,
drawConnectors = FALSE,
title = 'Volcano by cluster'
)
p + facet_wrap(~ cluster)Show plot
Another function is FindConservedMarkers. It the same function of FindMarkers but this usage is to identify the same features expression between clusters
sobj_conservedmarkers_mast <- Seurat::FindConservedMarkers(
# The variable that contains Seurat Object
object = sobj,
# Name of condition 1
ident.1 = "8",
# Name of condition 2
ident.2 = "10",
# Factor name in the Seurat Object
grouping.var = "orig.ident",
# Differential analysis method
assay="RNA",
slot="counts"
)
DT::datatable(
head(sobj_conservedmarkers_mast, n = 10),
caption = "MAST results"
)TDCT_p_val: Ignore this column. Always ignore raw p-values. Look at corrected ones, and if they are missing, then compute them.TDCT_avg_log2FC: Average Log2(FoldChange). Illustrates how much a gene is differentially expessed between samples in TDCT condition.TDCT_pct.1: Percent of cells with gene expression in condition one, here in cluster 8 in TDCT condition.TDCT_pct.2: Percent of cells with gene expression in condition two, here in cluster 10 in TDCT condition.TDCT_p_val_adj: The confidence we have in the result. The closer to 0, the lesser is the risk of error in TDCT condition.TD3A_p_val: Ignore this column. Always ignore raw p-values. Look at corrected ones, and if they are missing, then compute them.TD3A_avg_log2FC: Average Log2(FoldChange). Illustrates how much a gene is differentially expessed between samples in TD3A condition.TD3A_pct.1: Percent of cells with gene expression in condition one, here in cluster 8 in TD3A condition.TD3A_pct.2: Percent of cells with gene expression in condition two, here in cluster 10 in TD3A condition.TD3A_p_val_adj: The confidence we have in the result. The closer to 0, the lesser is the risk of error in TD3A condition.max_pval: Maximum of p_val between comparison TDCT vs TD3A. It must be low to indicate a conserved featureminimump_p_val: Minimum of p_val between comparison TDCT vs TD3A.
Seurat::VlnPlot(
# A subset of the Seurat object
# limited to clusters 8 and 10,
# or else we will plot all the clusters
object = subset(sobj, HarmonyStandalone_clusters %in% c("8", "10")),
# The name of the gene of interest (feature = gene)
features = "Plac8",
# The name of the Seurat cell annotation
split.by = "orig.ident",
# Change color for presentation
cols = c("blue", "red")
)Show plot
sobj_conservedmarkers_mast$direction = case_when(
sign(sobj_conservedmarkers_mast$TDCT_avg_log2FC) == sign(sobj_conservedmarkers_mast$TD3A_avg_log2FC) &
sign(sobj_conservedmarkers_mast$TDCT_avg_log2FC) == -1 ~ "negative",
sign(sobj_conservedmarkers_mast$TDCT_avg_log2FC) == sign(sobj_conservedmarkers_mast$TD3A_avg_log2FC) &
sign(sobj_conservedmarkers_mast$TDCT_avg_log2FC) == 1 ~ "positive",
.default = "opposite")
t_dataframe <- table(sobj_conservedmarkers_mast$direction,sobj_conservedmarkers_mast$max_pval<0.05)[,2]
levels_direction <- paste0(names(t_dataframe)," (nFeature=",t_dataframe,")")
sobj_conservedmarkers_mast_filtered = sobj_conservedmarkers_mast[sobj_conservedmarkers_mast$minimump_p_val<0.05,]
sobj_conservedmarkers_mast_filtered$direction = factor(sobj_conservedmarkers_mast_filtered$direction,
levels=c("negative","opposite","positive"),
labels=levels_direction)
ggplot() +
geom_point(data=sobj_conservedmarkers_mast_filtered,
aes(x = TDCT_avg_log2FC, y = TD3A_avg_log2FC, color = direction)) +
scale_color_manual(values=c("lightblue","grey70","red")) +
theme_bw() +
labs(x = "TDCT Average Log2 FoldChange cluster 8 vs cluster 12",
y = "TD3A Average Log2 FoldChange cluster 8 vs cluster 12",
title = "Result of Wilcoxon Test between cluster 8 vs cluster 12 between condition TDCT vs TD3A",
color = "Direction Log2FoldChange") Show plot
6.6 Heatmap
We’d like to display the expression of genes identified by FindMarkers.
Then we use the function DoHeatmap
from the package Seurat.
In order to limit the graph to differentially expressed reads, we use the
function rownames
from R base package on the DEA result table. In this example, I use
the results of wilcoxon, but you shall use any of the results you previously
obtained.
Seurat::DoHeatmap(
# variable pointing to seurat object
object = sobj,
# name of DE genes
features = base::rownames(sobj_wilcox),
# Cluster annotation
group.by = "HarmonyStandalone_clusters",
)Show plot
6.7 Volcano plot
A Volcano plot is usefull to identify differnetial expression analysis bias.
The package EnhancedVolcano has an eponym
function for that:
EnhancedVolcano::EnhancedVolcano(
# variable pointing to the DEA results
toptable = sobj_wilcox,
# Gene names
lab = rownames(sobj_wilcox),
# Column in which to find Fold Change
x = "avg_log2FC",
# Column in which to find confidence interval
y = "p_val_adj",
# Lower fold-change cut-off
FCcutoff = 0.2
)Show plot
6.8 Session Info
This list of all packages used while you work should be included in each en every R presentation:
Show output
R version 4.3.1 (2023-06-16)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Ubuntu 20.04.6 LTS
Matrix products: default
BLAS: /home/mna_bioinfo/anaconda3/envs/r431/lib/libblas.so.3.9.0
LAPACK: /home/mna_bioinfo/anaconda3/envs/r431/lib/liblapack.so.3.9.0
locale:
[1] LC_CTYPE=fr_FR.UTF-8 LC_NUMERIC=C
[3] LC_TIME=fr_FR.UTF-8 LC_COLLATE=fr_FR.UTF-8
[5] LC_MONETARY=fr_FR.UTF-8 LC_MESSAGES=fr_FR.UTF-8
[7] LC_PAPER=fr_FR.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C
time zone: Europe/Paris
tzcode source: system (glibc)
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] EnhancedVolcano_1.20.0 ggrepel_0.9.6
[3] UpSetR_1.4.0 SingleCellExperiment_1.24.0
[5] SummarizedExperiment_1.32.0 Biobase_2.62.0
[7] GenomicRanges_1.54.1 GenomeInfoDb_1.38.8
[9] IRanges_2.36.0 S4Vectors_0.40.2
[11] BiocGenerics_0.48.1 MatrixGenerics_1.14.0
[13] matrixStats_1.5.0 Seurat_5.3.1
[15] SeuratObject_5.2.0 sp_2.2-0
[17] dplyr_1.1.4 knitr_1.50
[19] rstatix_0.7.3 ggpubr_0.6.2
[21] ggplot2_3.5.2 DT_0.34.0
[23] BiocParallel_1.36.0
loaded via a namespace (and not attached):
[1] RcppAnnoy_0.0.22 splines_4.3.1 later_1.4.4
[4] bitops_1.0-9 tibble_3.3.0 polyclip_1.10-7
[7] fastDummies_1.7.5 lifecycle_1.0.4 Rdpack_2.6.4
[10] globals_0.18.0 lattice_0.22-7 MASS_7.3-60.0.1
[13] MAST_1.28.0 crosstalk_1.2.2 backports_1.5.0
[16] magrittr_2.0.4 limma_3.58.1 plotly_4.11.0
[19] sass_0.4.10 rmarkdown_2.30 plotrix_3.8-4
[22] jquerylib_0.1.4 yaml_2.3.10 qqconf_1.3.2
[25] httpuv_1.6.16 otel_0.2.0 sn_2.1.1
[28] sctransform_0.4.2 spam_2.11-1 spatstat.sparse_3.1-0
[31] reticulate_1.44.0 cowplot_1.2.0 pbapply_1.7-4
[34] RColorBrewer_1.1-3 multcomp_1.4-29 abind_1.4-8
[37] zlibbioc_1.48.2 Rtsne_0.17 presto_1.0.0
[40] purrr_1.1.0 RCurl_1.98-1.17 TH.data_1.1-4
[43] sandwich_3.1-1 GenomeInfoDbData_1.2.11 irlba_2.3.5.1
[46] listenv_0.9.1 spatstat.utils_3.2-0 TFisher_0.2.0
[49] goftest_1.2-3 RSpectra_0.16-2 spatstat.random_3.4-2
[52] fitdistrplus_1.2-4 parallelly_1.45.1 coin_1.4-3
[55] codetools_0.2-20 DelayedArray_0.28.0 tidyselect_1.2.1
[58] farver_2.1.2 rmdformats_1.0.4 spatstat.explore_3.5-3
[61] mathjaxr_1.8-0 jsonlite_2.0.0 multtest_2.58.0
[64] progressr_0.17.0 Formula_1.2-5 ggridges_0.5.7
[67] survival_3.8-3 progress_1.2.3 tools_4.3.1
[70] ica_1.0-3 Rcpp_1.1.0 glue_1.8.0
[73] mnormt_2.1.1 gridExtra_2.3 SparseArray_1.2.4
[76] metap_1.12 DESeq2_1.42.1 xfun_0.54
[79] numDeriv_2016.8-1.1 withr_3.0.2 fastmap_1.2.0
[82] digest_0.6.37 R6_2.6.1 mime_0.13
[85] colorspace_2.1-2 scattermore_1.2 tensor_1.5.1
[88] dichromat_2.0-0.1 spatstat.data_3.1-9 tidyr_1.3.1
[91] generics_0.1.4 data.table_1.17.8 prettyunits_1.2.0
[94] httr_1.4.7 htmlwidgets_1.6.4 S4Arrays_1.2.1
[97] uwot_0.2.3 pkgconfig_2.0.3 gtable_0.3.6
[100] modeltools_0.2-24 lmtest_0.9-40 XVector_0.42.0
[103] htmltools_0.5.8.1 carData_3.0-5 dotCall64_1.2
[106] bookdown_0.45 scales_1.4.0 png_0.1-8
[109] spatstat.univar_3.1-4 rstudioapi_0.17.1 reshape2_1.4.4
[112] nlme_3.1-168 cachem_1.1.0 zoo_1.8-14
[115] stringr_1.5.2 KernSmooth_2.23-26 libcoin_1.0-10
[118] vipor_0.4.7 parallel_4.3.1 miniUI_0.1.2
[121] ggrastr_1.0.2 pillar_1.11.1 grid_4.3.1
[124] vctrs_0.6.5 RANN_2.6.2 promises_1.4.0
[127] car_3.1-3 xtable_1.8-4 cluster_2.1.8.1
[130] beeswarm_0.4.0 evaluate_1.0.5 locfit_1.5-9.12
[133] mvtnorm_1.3-3 cli_3.6.5 compiler_4.3.1
[136] rlang_1.1.6 crayon_1.5.3 mutoss_0.1-13
[139] future.apply_1.20.0 ggsignif_0.6.4 labeling_0.4.3
[142] ggbeeswarm_0.7.2 plyr_1.8.9 stringi_1.8.7
[145] viridisLite_0.4.2 deldir_2.0-4 lazyeval_0.2.2
[148] spatstat.geom_3.6-0 Matrix_1.6-5 RcppHNSW_0.6.0
[151] hms_1.1.4 patchwork_1.3.2 future_1.67.0
[154] statmod_1.5.1 shiny_1.11.1 rbibutils_2.3
[157] ROCR_1.0-11 igraph_2.2.1 broom_1.0.10
[160] bslib_0.9.0